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Abstract 


While Recurrent Neural Networks 
(RNNs) are famously known to be Turing 
complete, this relies on infinite precision 
in the states and unbounded computation 
time. We consider the case of RNNs with 
finite precision whose computation time 
is linear in the input length. Under these 
limitations, we show that different RNN 
variants have different computational 
power. In particular, we show that the 
LSTM and the Elman-RNN with ReLU 
activation are strictly stronger than the 
RNN with a squashing activation and the 
GRU. This is achieved because LSTMs 
and ReLU-RNNs can easily implement 
counting behavior. We show empirically 
that the LSTM does indeed learn to 
effectively use the counting mechanism. 


1 Introduction 


Recurrent Neural Network (RNNs) emerge as 
very strong learners of sequential data. A famous 
result by Siegelmann and Sontag (1992; 1994), 
and its extension in (Siegelmann, 1999), demon- 
strates that an Elman-RNN (Elman, 1990) with a 
sigmoid activation function, rational weights and 
infinite precision states can simulate a Turing- 
machine in real-time, making RNNs Turing- 
complete. Recently, Chen et al (2017) extended 
the result to the ReLU activation function. How- 
ever, these constructions (a) assume reading the 
entire input into the RNN state and only then per- 
forming the computation, using unbounded time; 
and (b) rely on having infinite precision in the 
network states. As argued by Chen et al (2017), 
this is not the model of RNN computation used in 
NLP applications. Instead, RNNs are often used 
by feeding an input sequence into the RNN one 
item at a time, each immediately returning a state- 


vector that corresponds to a prefix of the sequence 
and which can be passed as input for a subse- 
quent feed-forward prediction network operating 
in constant time. The amount of tape used by a 
Turing machine under this restriction is linear in 
the input length, reducing its power to recogni- 
tion of context-sensitive language. More impor- 
tantly, computation is often performed on GPUs 
with 32bit floating point computation, and there is 
increasing evidence that competitive performance 
can be achieved also for quantized networks with 
4-bit weights or fixed-point arithmetics (Hubara 
et al., 2016). The construction of (Siegelmann, 
1999) implements pushing 0 into a binary stack by 
the operation g + g/4 + 1/4. This allows push- 
ing roughly 15 zeros before reaching the limit of 
the 32bit floating point precision. Finally, RNN 
solutions that rely on carefully orchestrated math- 
ematical constructions are unlikely to be found us- 
ing backpropagation-based training. 


In this work we restrict ourselves to input- 
bound recurrent neural networks with finite- 
precision states (IBF P-RNN), trained using back- 
propagation. This class of networks is likely to co- 
incide with the networks one can expect to obtain 
when training RNNs for NLP applications. An 
IBFP Elman-RNN is finite state. But what about 
other RNN variants? In particular, we consider the 
Elman RNN (SRNN) (Elman, 1990) with squash- 
ing and with ReLU activations, the Long Short- 
Term Memory (LSTM) (Hochreiter and Schmid- 
huber, 1997) and the Gated Recurrent Unit (GRU) 
(Cho et al., 2014; Chung et al., 2014). 


The common wisdom is that the LSTM and 
GRU introduce additional gating components that 
handle the vanishing gradients problem of train- 
ing SRNNs, thus stabilizing training and making 
it more robust. The LSTM and GRU are often 
considered as almost equivalent variants of each 
other. 
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Figure 1: Activations — c for LSTM and h for GRU — for networks trained on a”b” and a”b”c”. The 
LSTM has clearly learned to use an explicit counting mechanism, in contrast with the GRU. 


We show that in the input-bound, finite- 
precision case, there is a real difference between 
the computational capacities of the LSTM and the 
GRU: the LSTM can easily perform unbounded 
counting, while the GRU (and the SRNN) can- 
not. This makes the LSTM a variant of a k-counter 
machine (Fischer et al., 1968), while the GRU re- 
mains finite-state. Interestingly, the SRNN with 
ReLU activation followed by an MLP classifier 
also has power similar to a k-counter machine. 

These results suggest there is a class of formal 
languages that can be recognized by LSTMs but 
not by GRUs. In section 5, we demonstrate that 
for at least two such languages, the LSTM man- 
ages to learn the desired concept classes using 
back-propagation, while using the hypothesized 
control structure. Figure 1 shows the activations 
of 10-d LSTM and GRU trained to recognize the 
languages a”b” and a”b”c”. It is clear that the 
LSTM learned to dedicate specific dimensions for 
counting, in contrast to the GRU.! 


'Ts the ability to perform unbounded counting relevant to 
“real world” NLP tasks? In some cases it might be. For ex- 
ample, processing linearized parse trees (Vinyals et al., 2015; 
Choe and Charniak, 2016; Aharoni and Goldberg, 2017) re- 
quires counting brackets and nesting levels. Indeed, previous 
works that process linearized parse trees report using LSTMs 
and not GRUs for this purpose. Our work here suggests that 
this may not be a coincidence. 


2 The RNN Models 


An RNN is a parameterized function œR that takes 
as input an input vector x; and a state vector hy_1 
and returns a state vector hy: 


ht = R(xe, ht-1) (1) 


The RNN is applied to a sequence 71,...,2n, 
by starting with an initial vector ho (often the 0 
vector) and applying R repeatedly according to 
equation (1). Let © be an input vocabulary (al- 
phabet), and assume a mapping E from every vo- 
cabulary item to a vector x (achieved through a 1- 
hot encoding, an embedding layer, or some other 
means). Let RNN (z1, ..., £n) denote the state 
vector h resulting from the application of R to 
the sequence E(x1), ..., E(£n). An RNN recog- 
nizer (or RNN acceptor) has an additional func- 
tion f mapping states h to 0,1. Typically, f is a 
log-linear classifier or multi-layer perceptron. We 
say that an RNN recognizes a language LC &* 
if f(RNN(w)) returns 1 for all and only words 
W = Fiiair E La 


Elman-RNN (SRNN) In the Elman-RNN (El- 
man, 1990), also called the Simple RNN (SRNN), 
the function R takes the form of an affine trans- 
form followed by a tanh nonlinearity: 
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hi = tanh(Wa; + Uhy_1 +b) (2) 


Elman-RNNs are known to be at-least finite- 
state. Siegelmann (1996) proved that the tanh can 
be replaced by any other squashing function with- 
out sacrificing computational power. 


IRNN The IRNN model, explored by (Le et al., 
2015), replaces the tanh activation with a non- 
squashing ReLU: 


hi = max(0,(Wa,+Uh:-1+6)) (3) 


The computational power of such RNNs (given in- 
finite precision) is explored in (Chen et al., 2017). 


Gated Recurrent Unit (GRU) In the GRU 
(Cho et al., 2014), the function R incorporates a 
gating mechanism, taking the form: 


a = o(W*a,+U*hy_; +0) (4) 
re = o(W'a,+U"ht1 +0") (5) 
hy = tanh(W"a, + U" (rz 0 hi1) + 0" \6) 
h = aoh (1 a) o hi (7) 


Where ø is the sigmoid function and o is the 
Hadamard product (element-wise product). 


Long Short Term Memory (LSTM) In the 
LSTM (Hochreiter and Schmidhuber, 1997), R 
uses a different gating component configuration: 


fe = o(Wha,+Ufhy1+0/) (8) 


iù = o(W'e, + Uthi +b’) (9) 
o = o(W°ax, + U°h1 + 0°) (10) 
č& = tanh(W°a,+U°H-14+ 6°) (11) 
Ct = fto C1 tito č (12) 
he = orog(ci) (13) 


where g can be either tanh or the identity. 


Equivalences The GRU and LSTM are at least 
as strong as the SRNN: by setting the gates of the 
GRU to z = 0 and r; = 1 we obtain the SRNN 
computation. Similarly by setting the LSTM gates 
to i = lo: = 1, and fè = 0. This is easily 
achieved by setting the matrices W and U to 0, 
and the biases b to the (constant) desired gate val- 
ues. 

Thus, all the above RNNs can recognize finite- 
state languages. 


3 Power of Counting 


Power beyond finite state can be obtained by in- 
troducing counters. Counting languages and k- 
counter machines are discussed in depth in (Fis- 
cher et al., 1968). When unbounded computa- 
tion is allowed, a 2-counter machine has Turing 
power. However, for computation bound by in- 
put length (real-time) there is a more interesting 
hierarchy. In particular, real-time counting lan- 
guages cut across the traditional Chomsky hierar- 
chy: real-time k-counter machines can recognize 
at least one context-free language (a”b"), and at 
least one context-sensitive one (a”b"c"). How- 
ever, they cannot recognize the context free lan- 
guage given by the grammar S — zlaSa|bSb 
(palindromes). 


SKCM_ For our purposes, we consider a sim- 
plified variant of k-counter machines (SKCM). 
A counter is a device which can be incremented 
by a fixed amount (INC), decremented by a fixed 
amount (DEC) or compared to 0 (COMPO). In- 
formally,” an SKCM is a finite-state automaton 
extended with k counters, where at each step of 
the computation each counter can be incremented, 
decremented or ignored in an input-dependent 
way, and state-transitions and accept/reject de- 
cisions can inspect the counters’ states using 
CompoO. The results for the three languages dis- 
cussed above hold for the SKCM variant as well, 
with proofs provided in Appendix A. 


4 RNNs as SKCMs 


In what follows, we consider the effect on the 
state-update equations on a single dimension, 
hlj]. We omit the index [|j] for readability. 


LSTM The LSTM acts as an SKCM by des- 
ignating k dimensions of the memory cell c; as 
counters. In non-counting steps, set 7, = 0, f = 1 
through equations (8-9). In counting steps, the 
counter direction (+1 or -1) is set in ĉ& (equation 
11) based on the input x; and state hy_1. The 
counting itself is performed in equation (12), af- 
ter setting 7; = fe = 1. The counter can be reset 
to 0 by setting it = ft = 0. 

Finally, the counter values are exposed through 
hi = ovg(cz), making it trivial to compare the 
counter’s value to 0.3 

Formal definition is given in Appendix A. 

3Some further remarks on the LSTM: LSTM supports 


both increment and decrement in a single dimension. The 
counting dimensions in c; are exposed through a function 
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We note that this implementation of the SKCM 
operations is achieved by saturating the activa- 
tions to their boundaries, making it relatively easy 
to reach and maintain in practice. 


SRNN The finite-precision SRNN cannot desig- 
nate unbounded counting dimensions. 
The SRNN update equation is: 


hy = tanh(Wa + Uhy_; + b) 


dp, 


dy 
hili] = tanh(} | Wij lj] +>) Uih- [j] Oli) 
j=l j=l 


By properly setting U and W, one can get cer- 
tain dimensions of h to update according to the 
value of x, by hli] = tanh(h;y_; fi] + wix + bfi]). 
However, this counting behavior is within a tanh 
activation. Theoretically, this means unbounded 
counting cannot be achieved without infinite pre- 
cision. Practically, this makes the counting be- 
havior inherently unstable, and bounded to a rel- 
atively narrow region. While the network could 
adapt to set w to be small enough such that count- 
ing works for the needed range seen in training 
without overflowing the tanh, attempting to count 
to larger n will quickly leave this safe region and 
diverge. 


IRNN  Finite-precision IRNNs can perform un- 
bounded counting conditioned on input symbols. 
This requires representing each counter as two di- 
mensions, and implementing INC as incrementing 
one dimension, DEC as incrementing the other, 
and COMPO as comparing their difference to 0. In- 
deed, Appendix A in (Chen et al., 2017) provides 
concrete IRNNs for recognizing the languages 
ab” and a”b”c”. This makes IBFP-RNN with 
ReLU activation more powerful than IBFP-RNN 
with a squashing activation. Practically, ReLU- 
activated RNNs are known to be notoriously hard 


g. For both g(x) = x and g(x) = tanh(z), it is trivial 
to do compare 0. Another operation of interest is compar- 
ing two counters (for example, checking the difference be- 
tween them). This cannot be reliably achieved with g(x) = 
tanh(a), due to the non-linearity and saturation properties 
of the tanh function, but is possible in the g(x) = = case. 
LSTM can also easily set the value of a counter to 0 in one 
step. The ability to set the counter to 0 gives slightly more 
power for real-time recognition, as discussed by Fischer et al. 
(1968). 

Relation to known architectural variants: Adding peep- 
hole connections (Gers and Schmidhuber, 2000) essentially 
sets g(x) = x and allows comparing counters in a stable 
way. Coupling the input and the forget gates (i; = 1 — f+) 
(Greff et al., 2017) removes the single-dimension unbounded 
counting ability, as discussed for the GRU. 


to train because of the exploding gradient prob- 
lem. 


GRU Finite-precision GRUs cannot implement 
unbounded counting on a given dimension. The 
tanh in equation (6) combined with the interpola- 
tion (tying z: and 1 — z+) in equation (7) restricts 
the range of values in A to between -1 and 1, pre- 
cluding unbounded counting with finite precision. 
Practically, the GRU can learn to count up to some 
bound m seen in training, but will not generalize 
well beyond that. Moreover, simulating forms of 
counting behavior in equation (7) require consis- 
tently setting the gates z+, r+ and the proposal hy 
to precise, non-saturated values, making it much 
harder to find and maintain stable solutions. 


Summary We show that LSTM and IRNN 
can implement unbounded counting in dedicated 
counting dimensions, while the GRU and SRNN 
cannot. This makes the LSTM and IRNN at least 
as strong as SKCMs, and strictly stronger than the 
SRNN and the GRU.” 


5 Experimental Results 


Can the LSTM indeed learn to behave as a k- 
counter machine when trained using backpropa- 
gation? We show empirically that: 


1. LSTMs can be trained to recognize a”b” and 
abc”. 


2. These LSTMs generalize to much higher n 
than seen in the training set (though not in- 
finitely so). 


3. The trained LSTM learn to use the per- 
dimension counting mechanism. 


4. The GRU can also be trained to recognize 
ab” and a”b”c”, but they do not have clear 
counting dimensions, and they generalize to 
much smaller n than the LSTMs, often fail- 
ing to generalize correctly even for n within 
their training domain. 


“One such mechanism could be to divide a given dimen- 
sion by k > 1 at each symbol encounter, by setting z+ = 1/k 
and h; = 0. Note that the inverse operation would not be im- 
plementable, and counting down would have to be realized 
with a second counter. 

One can argue that other counting mechanisms— 
involving several dimensions—are also possible. Intuitively, 
such mechanisms cannot be trained to perform unbounded 
counting based on a finite sample as the model has no means 
of generalizing the counting behavior to dimensions beyond 
those seen in training. We discuss this more in depth in Ap- 
pendix B, where we also prove that an SRNN cannot repre- 
sent a binary counter. 
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5. Trained LSTM networks outperform trained 
GRU networks on random test sets for the 
languages a”b” and a”b"c”. 


Similar empirical observations regarding the 
ability of the LSTM to learn to recognize a”b” and 
a”b”c” are described also in (Gers and Schmidhu- 
ber, 2001). 

We train 10-dimension, 1-layer LSTM and 
GRU networks to recognize a”b” and a”b"c". For 
a”b” the training samples went up to n = 100 and 
for a”b"c” up to n = 50.6 


Results On a”b”, the LSTM generalizes well up 
to n = 256, after which it accumulates a devia- 
tion making it reject a”b” but recognize a”b”*+1 
for a while, until the deviation grows.’ The GRU 
does not capture the desired concept even within 
its training domain: accepting a”b"*! for n > 38, 
and also accepting a”b"*+? for n > 97. It stops 
accepting a”b” for n > 198. 

On a”b"c” the LSTM recognizes well until n = 
100. It then starts accepting also a”b”+1c". At 
n > 120 it stops accepting a”b”c” and switches to 
accepting a”b"*!c", until at some point the devi- 
ation grows. The GRU accepts already a9b!°c!”, 
and stops accepting a”b"c” for n > 63. 

Figure la plots the activations of the 10 dimen- 
sions of the a”b"-LSTM for the input @1000p1000, 
While the LSTM misclassifies this example, the 
use of the counting mechanism is clear. Fig- 
ure 1b plots the activation for the a”b"c"” LSTM 
on at00p100¢100, Here, again, the two counting 
dimensions are clearly identified—indicating the 
LSTM learned the canonical 2-counter solution— 
although the slightly-imprecise counting also 
starts to show. In contrast, Figures lc and 1d 
show the state values of the GRU-networks. The 
GRU behavior is much less interpretable than the 
LSTM. In the a”b” case, some dimensions may be 
performing counting within a bounded range, but 
move to erratic behavior at around t = 1750 (the 


*Implementation in DyNet, using the SGD Optimizer. 
Positive examples are generated by sampling n in the desired 
range. For negative examples we sample 2 or 3 n values in- 
dependently, and ensuring at least one of them differs from 
the others. We dedicate a portion of the examples as the dev 
set, and train up to 100% dev set accuracy. 

TThese fluctuations occur as the networks do not fully sat- 
urate their gates, meaning the LSTM implements an imper- 
fect counter that accumulates small deviations during com- 
putation, e.g.: increasing the counting dimension by 0.99 but 
decreasing only by 0.98. Despite this, we see that the its so- 
lution remains much more robust than that found by the GRU 
— the LSTM has learned the essence of the counting based 
solution, but its implementation is imprecise. 


network starts to misclassify on sequences much 
shorter than that). The a”b"c” state dynamics are 
even less interpretable. 

Finally, we created 1000-sample test sets for 
each of the languages. For ab” we used words 
with the form a”+’b"*+) where n € rand(0, 200) 
and i,j € rand(—2,2), and for a"b"c” we use 
words of the form a”*+*b?+Jc"+* where n € 
rand(0,150) and i,j,k € rand(—2,2). The 
LSTM’s accuracy was 100% and 98.6% on a”b” 
and a”b”c” respectively, as opposed to the GRU’s 
87.0% and 86.9%, also respectively. 

All of this empirically supports our result, 
showing that IBFP-LSTMs can not only theoret- 
ically implement “unbounded” counters, but also 
learn to do so in practice (although not perfectly), 
while IBFP-GRUs do not manage to learn proper 
counting behavior, even when allowing floating 
point computations. 


6 Conclusions 


We show that the IBFP-LSTM can model a real- 
time SKCM, both in theory and in practice. This 
makes it more powerful than the IBFP-SRNN 
and the IBFP-GRU, which cannot implement un- 
bounded counting and are hence restricted to rec- 
ognizing regular languages. The IBFP-IRNN can 
also perform input-dependent counting, and is 
thus more powerful than the IBFP-SRNN. 

We note that in addition to theoretical distinc- 
tions between architectures, it is important to con- 
sider also the practicality of different solutions: 
how easy it is for a given architecture to discover 
and maintain a stable behavior in practice. We 
leave further exploration of this question for fu- 
ture work. 
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Appendix 
A Simplified K-Counter Machines 


We use a simplified variant of the k-counter ma- 
chines (SKCM) defined in (Fischer et al., 1968), 
which has no autonomous states and makes clas- 
sification decisions based on a combination of its 
current state and counter values. This variant con- 
sumes input sequences on a symbol by symbol ba- 
sis, updating at each step its state and its coun- 
ters, the latter of which may be manipulated by 
increment, decrement, zero, or no-ops alone, and 
observed only by checking equivalence to zero. 
To define the transitions of this model its accept- 
ing configurations, we will introduce the follow- 
ing notations: 

Notations We define z : Z* — {0,1}* as fol- 
lows: for every n € ZK, for every 1 < i < 
k, z(n); = 0 iff n; = 0 (this function masks 
a set of integers such that only their zero-ness 
is observed). For a vector of operations, o € 
{—1, +1, x0, x1}*, we denote by o(n) the point- 
wise application of the operations to the vector 
n € Z*, e.g. for o = (+1, x0, x1), o((5,2,3)) = 
(6,0,3). 

We now define the model. An SKCM is a tuple 
M = (£, Q, qo, k, 6, u, F) containing: 


A finite input alphabet © 

A finite state set Q 

An initial state go € Q 

k € N, the number of counters 
A state transition function 


te et 


6:Qx=x {0,1} >Q 
6. A counter update function® 
@2 — {—1, +1, x0, x1} 
7. A set of accepting masked? configurations 
FC Qx {0,1}" 


The set of configurations of an SKCM is the set 
C = Q x Z*, and the initial configuration is co = 
(qo, 0) (i.e., the counters are initiated to zero). The 


8 We note that in this definition, the counter update func- 
tion depends only on the input symbol. In practice we see 
that the LSTM is not limited in this way, and can also update 
according to some state-input combinations — as can be seen 
when it it is taught, for instance, the language a”ba” We do 
not explore this here however, leaving a more complete char- 
acterization of the learnable models to future work. 

°i,e., counters are observed only by zero-ness. 


transitions of an SKCM are as follows: given a 
configuration c¢ = (q, n) (n € Z*) and input w: € 
X, the next configuration of the SKCM is Ct+1 = 
(5(q, we, 2(n)), ulw) (n). 

The language recognized by a k-counter ma- 
chine is the set of words w for which the machine 
reaches an accepting configuration — a configu- 
ration c = (q, n) for which (q, z(n)) € F. 

Note that while the counters can and are in- 
creased to various non-zero values, the transition 
function ô and the accept/reject classification of 
the configurations observe only their zero-ness. 


A.1 Computational Power of SKCMs 


We show that the SKCM model can recognize 
the context-free and context-sensitive languages 
a”b” and a”b”c”, but not the context free lan- 
guage of palindromes, meaning its computational 
power differs from the language classes defined 
in the Chomsky hierarchy. Similar proofs appear 
in (Fischer et al., 1968) for their variant of the k- 
counter machine. 


a”b”: We define the following SKCM over the 
alphabet {a, b}: 


1. Q = {da,%; ar} 

2. qo = qa 

3.k=1 

4. u(a) = +1, u(b) = —1 

5. for any z € {0,1}: 
Ò(qa, a, z) = qa; Ô(qa, b, z) = Qb, 
6(q,4,z)=G, (6,2) = Qo 
Ôlqr,a, Zz) =qr,  Ôlqr,b, zZ) = qr 

6. C = {(%,0)} 


The state q, is a rejecting sink state, and the states 
qa and q keep track of whether the sequence is 
currently in the “a” or “b” phase. If an a is seen 
after moving to the b phase, the machine moves 
to (and stays in) the rejecting state. The counter is 
increased on input a and decreased on input b, and 
the machine accepts only sequences that reach the 
state q with counter value zero, i.e., that have in- 
creased and decreased the counter an equal num- 
ber of times, without switching from b to a. It fol- 
lows easily that this machine recognizes exactly 
the language a”b”. 


a”b”c”: We define the following SKCM over 
the alphabet {a, b}. As its state transition function 
ignores the counter values, we use the shorthand 
6(q,@) for 5(q, 0, z), for all z € {0,1}?. 


1. Q = {0% qc; qr} 
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2. qo = da 

3.k=2 

4. u(a) = (+1,9), 
u(b) = (-1, +1), 
u(c) = (@,—-1) 


5. for any z € {0,1}: 


5 (qa; @) = qa; (qa; b) = Qh, 5 (qa; €) = 
qdr, 
6(qo,@) = dr, 4(qo,6) = q%, Ôl, c) = 
de> 
(qe, a) = dr; (qc; b) = qr; 5(Ge,€) = 
de; 


(qr, a) =r; (qr, b) = qr, (dr, €) = dr 
6. C = {(qe, 0,0)} 


By similar reasoning as that for a”b”, we see 


that this machine recognizes exactly the language 
a”b"c”. We note that this construction can be ex- 
tended to build an SKCM for any language of the 
sort ajas...ay,, using k = m — 1 counters and 
k + 1 states. 


Palindromes: We prove that no SKCM can rec- 
ognize the language of palindromes defined over 
the alphabet {a,b,x} by the grammar S —> 
x|aSa|bSb. The intuition is that in order to cor- 
rectly recognize this language in an one-way set- 
ting, one must be able to reach a unique configura- 
tion for every possible input sequence over {a, b} 
(requiring an exponential number of reachable 
configurations), whereas for any SKCM, the num- 
ber of reachable configurations is always polyno- 
mial in the input length.!° 

Let M be an SKCM with k counters. As its 
counters are only manipulated by steps of 1 or re- 
sets, the maximum and minimum values that each 
counter can attain on any input w € b* are +|w| 
and —|w|, and in particular the total number of 
possible values a counter could reach at the end 
of input w is 2}w| + 1. This means that the total 
number of possible configurations M could reach 
on input of length n is c(n) = |Q] - (2n + 1)*. 

c(n) is polynomial in n, and so there exists a 
value m for which the number of input sequences 
of length m over {a,b} — 2" — is greater than 
c(m). It follows by the pigeonhole principle that 
there exist two input sequences wy # w2 € 
{a,b} for which M reaches the same configu- 
ration. This means that for any suffix w € b*, 
and in particular for w = z - w,! where w ' is 
the reverse of w1, M classifies wı - w and ws - w 


This will hold even if the counter update function can 
rely on any state-input combination. 


identically—despite the fact that w 1 -x- w, is in 
the language and w2 - x - w’ is not. This means 
that M necessarily does not recognize this palin- 
drome language, and ultimately that no such M 
exists. 

Note that this proof can be easily generalized to 
any palindrome grammar over 2 or more charac- 
ters, with or without a clear ‘midpoint’ marker. 


B Impossibility of Counting in Binary 


While we have seen that the SRNN and GRU can- 
not allocate individual counting dimensions, the 
question remains whether they can count using a 
more elaborate mechanism, perhaps over several 
dimensions. We show here that one such mecha- 
nism — a binary counter — is not implementable 
in the SRNN. 

For the purposes of this discussion, we first de- 
fine a binary counter in an RNN. 


Binary Interpretation In an RNN with hidden 
state values in the range (—1, 1), the binary inter- 
pretation of a sequence of dimensions dj, ..., dn 
of its hidden state is the binary number obtained 
by replacing each positive hidden value in the se- 
quence with a ‘1° and each negative value with 
a ‘0’. For instance: the binary interpretation of 
the dimensions 3,0,1 in the hidden state vector 
(0.5, —0.1, 0.3, 0.8) is 110, i.e., 6. 


Binary Counting We say that the dimensions 
dı, dg, ..., dn in an RNN’s hidden state implement 
a binary counter in the RNN if, in every transi- 
tion, their binary interpretation either increases, 
decreases, resets to 0, or doesn’t change. !! 

A similar pair of definitions can be made for 
state values in the range (0, 1). 

We first note intuitively that an SRNN would 
not generalize binary counting to a counter with 
dimensions beyond those seen in training — as it 
would have no reason to learn the ‘carry’ behav- 
ior between the untrained dimensions. We prove 
further that we cannot reasonably implement such 
counters regardless. 

We now present a proof sketch that a single- 
layer SRNN with hidden size n > 3 cannot im- 
plement an n-dimensional binary counter that will 
consistently increase on one of its input symbols. 
After this, we will prove that even with helper 


''We note that the SKCMs presented here are more re- 
stricted in their relation between counter action and transi- 
tion, but prefer here to give a general definition. Our proof 
will be relevant even within the restrictions. 
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dimensions, we cannot implement a counter that 
will consistently increase on one input token and 
decrease on another — as we might want in order 
to classify the language of all words w for which 
Ha(w) = #o(w).! 

Consistently Increasing Counter: The proof re- 
lies on the linearity of the affine transform Wz + 
Uh + b, and the fact that ‘carry’ is a non-linear 
operation. We work with state values in the range 
(—1,1), but the proof can easily be adapted to 
(0,1) by rewriting h as h’ + 0.5, where h’ = 
h — 0.5 is a vector with values in the range 


(—0.5, 0.5). 
Suppose we have a single-layer SRNN with 
hidden size n = 3, such that its entire hidden 


state represents a binary counter that increases 
every time it receives the input symbol a. We 
denote by x, the embedding of a, and assume 
w.l.o.g. that the hidden state dimensions are or- 
dered from MSB to LSB, e.g. the hidden state 
vector (1, 1, —1) represents the number 1 10=6. 

Recall that the binary interpretation of the hid- 
den state relies only on the signs of its values. We 
use p and n to denote ‘some’ positive or negative 
value, respectively. Then the number 6 can be rep- 
resented by any state vector (p, p, n). 

Recall also that the SRNN state transition is 


hi = tanh(Wa; + Uhi- + b) 


and consider the state vectors (—1,1,1) and 
(1, —1, —1), which represent 3 and 4 respectively. 
Denoting b = W £a +, we find that the constants 
U and b must satisfy: 


tanh(U(-—1, 1, 1) 


+b) = (p, n,n) 
tanh(U(1,—1,—1) +6 


j) = 
As tanh is sign-preserving, this simplifies to: 


U(=1; 1, 1) = (p,n, n) -b 
U(1,—-1,-1) =(p,n,p) —6 


Noting the linearity of matrix multiplication and 
that (1, -1, -1) = —(—1,1, 1), we obtain: 


1)) = 


"Of course a counter could also be ‘decreased’ by in- 
crementing a parallel, ‘negative’ counter, and implementing 
compare-to-zero as a comparison between these two. As in- 
tuitively no RNN could generalize binary counting behav- 
ior to dimensions not used in training, this approach could 
quickly find both counters outside of their learned range even 
on a sequence where the difference between them is never 
larger than in training. 


U(—1,1,1) = U(-(1,-1, U(1,—1,—1) 


(p, n,n) —b=b- (p, n, p) 
i.e. for some assignment to each p and n, 2b = 
(p, n,n) + (p,n, p), and in particular b[1] < 0. 

Similarly, for (—1,—1,1) and (1,1,—1), we 

obtain 
U(—1,—1,1) = (n, p,n) — b 

U(1, 1, =1) = (p, p, p) =b 
i.e. 

(n,p,n) — b= b— (p, p,p) 
or 2b = (p, p, p) + (n, p,n), and in particular that 
b[1] > 0, leading to a contradiction and proving 
that such an SRNN cannot exist. The argument 
trivially extends to n > 3 (by padding from the 
MSB). 

We note that this proof does not extend to the 
case where additional, non counting dimensions 
are added to the RNN — at least not without fur- 
ther assumptions, such as the assumption that the 
counter behave correctly for all values of these di- 
mensions, reachable and unreachable. One may 
argue then that, with enough dimensions, it could 
be possible to implement a consistently increasing 
binary counter on a subset of the SRNN’s state. !° 
We now show a counting mechanism that cannot 
be implemented even with such ‘helper’ dimen- 
sions. 

Bi-Directional Counter: We show that for n > 
3, no SRNN can implement an n-dimensional bi- 
nary counter that increases for one token, Cup, and 
decreases for another, Ogown. AS before, we show 
the proof explicitly for n = 3, and note that it can 
be simply expanded to any n > 3 by padding. 

Assume by contradiction we have such an 
SRNN, with m > 3 dimensions, and assume 
w.l.o.g. that a counter is encoded along the first 
3 of these. We use the shorthand (v1, v2, v3)c 
to show the values of the counter dimensions 
explicitly while abstracting the remaining state 
dimensions, e.g. we write the hidden state 
(—0.5,0.1,1,1,1) as (—0.5,0.1,1)c where c = 
(1,1). 

Let £up and £down be the embeddings of Cup 
and down, and as before denote bup = W £up + b 
and bdown = W Zdown + b. Then for some reach- 
able state h; € R where the counter value is 
1 (e.g., the state reached on the input sequence 
Gp Ds we find that the constants U, bgown, and 


'3(By storing processing information on the additional, 
‘helper’ dimensions) 

'4(Or whichever appropriate sequence if the counter is not 
initiated to zero.) 
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bup must satisfy: 


tanh(Uh, + bup) = (n, p, n)cı 
tanh(Uhı + baoum) = (n, n, n)c2 


(i.€., Cup increases the counter and updates the ad- 
ditional dimensions to the values c1, while Cdown 
decreases and updates to cg.) Removing the sign- 
preserving function tanh we obtain the constraints 


Uh, F bup = (n, p, n)sign(c1) 
Uhı + bdown = (n, n, n)sign(c2) 


i.e. (bup — bdown)[0 : 2] = (n, p,n) — (n,n, n), 
and in particular (bup — bdown)|1] > 0. Now con- 
sider a reachable state h3 for which the counter 
value is 3. Similarly to before, we now obtain 


Uh + bup = (p, n, n)sign(c3) 
Uh3 + bdown = (n, p, n)sign(ca) 


from which we get (bup — bdown)[O0 : 2] = 
(p,n,n) — (n,p,n), and in particular (bup — 
bdown)[1] < 0, a contradiction to the previous 
statement. Again we conclude that no such SRNN 
can exist. 
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